Pupil-Indexed Arousal and Psychometric Sensitivity in Older Adults

Author

Mohammad Dastgheib

1 Introduction

1.1 Physical–Cognitive Dual-Task Interactions in Everyday Life and Aging

Everyday life frequently requires people to perform cognitively demanding tasks while simultaneously exerting physical effort. Surgeons must maintain continuous muscle engagement while making fine perceptual discriminations, athletes coordinate complex motor sequences while monitoring environmental cues, and older adults navigate dual-task situations—such as walking while monitoring traffic or carrying groceries while having a conversation—in which physical and cognitive demands interact dynamically. These real-world scenarios highlight a fundamental challenge: when multiple task domains compete for limited cognitive resources, performance in one or both domains may suffer (Pashler, 1994; Wickens, 2008). This challenge takes on particular significance in aging, as older adults face both declining cognitive capacity and increased physiological costs of effortful engagement (Salthouse, 1996; Verhaeghen et al., 2003).

A growing literature suggests that physical–cognitive interactions are mediated by shared arousal systems, particularly the locus coeruleus–noradrenergic (LC–NE) system, which modulates sensory gain, evidence accumulation, and response caution as a function of task demands and internal state (Aston-Jones & Cohen, 2005; Jepma et al., 2012; Mather & Harley, 2016). Understanding how physical effort modulates cognitive performance—and how this relationship changes with age—requires moving beyond simple outcome measures (overall accuracy or mean reaction time) to characterize how effort-induced arousal alters the fundamental processes underlying perceptual decision-making. The present study addresses this question by combining a physical–cognitive dual-task paradigm with pupillometry and psychometric function analysis to examine how physical effort–induced arousal relates to perceptual sensitivity in older adults.

1.2 Psychometric Functions: Quantifying Perceptual Sensitivity

1.2.1 Psychometric Functions as Measurement Models

Psychometric functions (PFs) provide a principled framework for characterizing how continuous stimulus intensity relates to perceptual judgments. Rather than collapsing performance across stimulus levels, PFs model the full intensity–response relationship, revealing how the probability of a given response changes as stimulus differences become more or less discriminable. In the present same–different discrimination paradigm, the PF maps a continuous intensity dimension (e.g., frequency offset or contrast difference) onto the probability of responding “different” (Green et al., 1966; Prins et al., 2016; Wichmann & Hill, 2001). This perspective aligns with a measurement-model view of perception: PFs can be derived from latent sensory evidence that is corrupted by internal noise and then transformed into an overt response via a decision rule (Prins et al., 2016). Within this framework, changes in PF steepness are typically interpreted as changes in discriminability (signal-to-noise separation), whereas horizontal shifts can reflect changes in response tendency or criterion, and thus require more cautious interpretation when the task permits bias (Macmillan & Creelman, 2005).

1.2.2 Model Specification and Parameter Interpretation

Psychometric functions take sigmoidal (S-shaped) forms and can be expressed using several closely related link families, including the cumulative normal (probit), logistic (logit), and Weibull functions. For a binary response model, one common formulation is the probit PF:

\[P(\text{"different"} | X) = \Phi\left(\frac{X - \alpha}{\beta}\right) \tag{1}\]

where \(X\) denotes stimulus intensity, \(\Phi(\cdot)\) is the cumulative standard normal distribution, \(\alpha\) is a location parameter corresponding to the stimulus intensity at which responding reaches a criterion level (often defined at a midpoint between asymptotes), and \(\beta\) is a scale/spread parameter. Under this parameterization, PF steepness is inversely related to \(\beta\): smaller \(\beta\) values yield steeper functions (higher discriminability), whereas larger \(\beta\) values yield shallower functions (lower discriminability). The steepness directly reflects discriminability—how rapidly the probability of choosing “different” changes as stimulus intensity increases. Under the probit parameterization, the standard deviation of the underlying sensory noise distribution is \(\sigma = \beta\) (the scale parameter directly corresponds to the noise SD). An analogous logit specification replaces \(\Phi(\cdot)\) with the logistic function, and in practice different sigmoid families typically yield very similar fits, with differences primarily reflecting parameterization and threshold conventions (Prins et al., 2016). For example, the Weibull’s conventional threshold parameter corresponds to approximately 0.632 on the cumulative response scale, whereas Quick/log-Quick formulations correspond to 0.5; therefore, threshold values should always be interpreted relative to the criterion at which they are defined (Prins et al., 2016).

In many psychophysical applications, PFs are described using a four-parameter form that includes asymptotes: a lower asymptote (guess rate, \(\gamma\)) and an upper asymptote defined by the lapse rate (\(\lambda\)), in addition to location and scale parameters (Prins et al., 2016). In forced-choice accuracy tasks, \(\gamma\) is constrained by task structure (e.g., \(\gamma=0.5\) in 2AFC; \(\gamma=1/M\) in M-AFC). In judgment-based tasks such as same–different responding, the lower asymptote reflects baseline “different” responding (false-alarm tendency) rather than a fixed chance level. The lapse rate \(\lambda\) captures stimulus-independent errors (e.g., attentional lapses, motor errors, or track loss) that can prevent performance from reaching the upper bound even at high intensities. Importantly, psychophysical guidance emphasizes that assuming \(\lambda=0\) when lapses are present can bias threshold and slope estimates, and that lapse is difficult to estimate reliably without adequate constraints or priors (Prins et al., 2016; Wichmann & Hill, 2001). The full four-parameter formulation is:

\[P(\text{"different"} | X) = \gamma + (1 - \gamma - \lambda) \cdot \Phi\left(\frac{X - \alpha}{\beta}\right)\]

These considerations motivate interpreting location parameters cautiously when criterion and lapses may vary across conditions, while treating changes in steepness as a cleaner index of sensitivity. The location parameter \(\alpha\) represents the stimulus intensity at a criterion level—often defined at the midpoint between the lower and upper asymptotes (e.g., halfway between guess rate \(\gamma\) and \(1-\lambda\)), which typically corresponds to approximately 50% “different” responses on the response probability scale. In appearance/judgment tasks where 50% indicates subjective equality (e.g., “brighter vs. darker”), this location is often called a point of subjective equality (PSE). However, in same–different discrimination tasks, the location parameter can reflect a mix of sensitivity and decision criterion/bias (Prins et al., 2016), so “threshold” or “midpoint” is the more standard terminology. Changes in \(\alpha\) may reflect shifts in decision criterion or response bias rather than true sensitivity changes, particularly when criterion can drift with arousal or effort conditions. In forced-choice accuracy PFs, the lower asymptote is constrained by the number of alternatives (e.g., \(\gamma = 0.5\) in 2AFC), making “threshold” conventionally defined at a criterion such as 75% correct. In judgment-based PFs (like “different” responses), threshold is often defined at the midpoint between asymptotes (approximately 50% “different” responses), but this location can reflect both sensitivity and criterion/bias. Sensitivity is better reflected in the steepness of the intensity–response relationship (how rapidly probability changes with intensity), which is why the present analyses emphasize slope and its modulation by arousal rather than threshold alone. This critical distinction between location (criterion) and sensitivity (steepness) parameters is illustrated in Figure Figure 1.

The psychometric function is closely related to signal detection theory (SDT), where the location parameter (\(\alpha\)) corresponds to the decision criterion and the scale parameter (\(\beta\)) relates to \(d'\) (sensitivity). Changes in location reflect criterion shifts (response bias), while changes in steepness (inversely related to \(\beta\)) reflect true sensitivity changes (Macmillan & Creelman, 2005). Some psychophysical procedures (e.g., yes/no detection tasks) are particularly susceptible to criterion shifts, making proportion correct a poor sensitivity measure when response bias varies across conditions or individuals (Prins et al., 2016). SDT measures like \(d'\) separate sensitivity from bias, providing a more robust index of discriminability. In the present “same–different” judgment task, criterion issues are acknowledged: the location parameter (\(\alpha\)) is interpreted cautiously as potentially reflecting criterion/bias shifts, whereas sensitivity is emphasized via the steepness parameter (\(\beta\) or the intensity coefficient in the GLMM). The hierarchical GLMM framework includes random intercepts to account for individual differences in baseline response tendency (criterion), while the key interaction term tests arousal effects on sensitivity (steepness) rather than merely on response bias. Figure 1 illustrates this distinction between location (criterion) and sensitivity (steepness) parameters.

Figure 1: Psychometric function parameters: location vs sensitivity

1.2.3 Implementation in the Present Study

This chapter leverages two complementary PF approaches that serve distinct purposes. First, descriptive PF parameters (thresholds and slopes) that provide the behavioral backbone for Chapter 2 were estimated previously using Psignifit 4 in MATLAB (Schütt et al., 2016), which supports robust estimation and can account for overdispersion. A Weibull PF was fit to each participant’s proportion of “different” responses across stimulus levels within each task and effort condition. Because the Weibull function is undefined at zero, the 0-offset intensity level was replaced with half of the smallest non-zero intensity level. Guess and lapse parameters were allowed to vary freely during fitting, and threshold and slope were derived at the effective midpoint—halfway between the estimated guess rate \(\gamma\) and \(1-\lambda\)—yielding a criterion comparable to a 50% point on the effective response scale.

Second, the primary inferential analyses in the present chapter use a hierarchical generalized linear mixed model (GLMM) with a probit link to model trialwise \(P(\text{"different"} \mid X)\). In a probit-link GLMM, the linear predictor operates on a latent standard-normal (\(z\)) evidence scale, and the intensity coefficient governs the steepness of the intensity–response relationship in that latent space. The critical interaction between stimulus intensity and within-person pupil state (\(X_{ij} \times P^{(\text{state})}_{ij}\)) therefore tests whether moment-to-moment arousal modulates sensitivity (steepness) rather than merely shifting response tendency (criterion). Although the trial-level GLMM does not explicitly estimate lapse parameters, lapse-like contamination is minimized through stringent pupil quality tiers and robustness checks, preserving statistical efficiency while addressing psychophysical concerns about stimulus-independent error. The probit link function is used in the GLMM for consistency with the latent evidence interpretation in signal detection theory, where the cumulative normal distribution naturally maps onto \(z\)-scaled evidence. In the present study, stimulus intensity is modeled on its original scale (frequency offset in Hz for auditory, contrast difference for visual), and the continuous-intensity GLMM framework avoids binning regardless of the underlying scale.

1.2.4 Why Continuous-Intensity Modeling

Modeling the continuous intensity–response relationship offers clear advantages over collapsing trials into discrete “easy/hard” bins: it preserves information, yields parameters with interpretable psychological meaning, and improves statistical efficiency by using the full range of stimulus values. Most importantly for the present dissertation aims, steepness provides a direct test of how arousal alters psychometric sensitivity—whether arousal changes the rate at which evidence accumulates across intensity—rather than only changing average responding.

1.3 Arousal, Effort, and Perceptual Sensitivity

The relationship between arousal and cognitive performance has been formalized in several influential frameworks. The Yerkes–Dodson law proposes an inverted-U function in which performance improves with increasing arousal up to an optimal point, beyond which additional arousal—particularly when experienced as stress or anxiety—impairs performance (Yerkes & Dodson, 1908). Adaptive Gain Theory (AGT) provides a mechanistic account centered on the locus coeruleus–norepinephrine (LC–NE) system, linking tonic (baseline) and phasic (event-evoked) LC activity to shifts in neural gain and, ultimately, task performance (Aston-Jones & Cohen, 2005). Under AGT, optimal performance is expected when tonic LC activity is moderate and phasic responses to task-relevant events are robust, supporting focused attention and efficient responding.

In aging, arousal–performance relationships may be altered in ways that complicate these canonical predictions. Evidence suggests that older adults may exhibit a leftward shift or compression of the inverted-U curve, such that peak performance occurs at lower levels of objective demand relative to younger adults (Mather & Harley, 2016; Mikneviciute et al., 2022). As a result, conditions that are neutral or even beneficial for younger adults may push older adults into supra-optimal arousal states, placing them on the descending limb of the curve and producing performance decrements (Huang & Clewett, 2024; Mather & Harley, 2016). This aging-related vulnerability motivates testing whether experimentally induced arousal—here, physical effort—selectively disrupts the quality of perceptual evidence and the sensitivity of discrimination.

Physical effort provides a controlled method for manipulating arousal state in a way that is experimentally separable from stimulus difficulty. Sustained isometric handgrip at moderate-to-high intensities (e.g., ~30–40% MVC) elicits a robust pressor and sympathoexcitatory response—raising arterial pressure and heart rate and increasing sympathetic outflow—whose magnitude scales with contraction intensity and remains present in older adults. At the neural level, static handgrip and post-exercise ischemia recruit brainstem and cortical regions involved in autonomic control and effort, offering a plausible central substrate for effort-induced arousal shifts that may interact with perceptual decision processes (Lalande et al., 2014; Mark et al., 1985; Sander et al., 2010; Toska, 2010).

At the same time, physical effort may influence performance through mechanisms that are not purely “arousal,” including resource-based interference. Limited-capacity and multiple-resource accounts predict that concurrent physical effort can compete with cognitive task performance when total demands exceed available capacity or when tasks draw on overlapping resource pools (Wickens, 2008). Supporting this possibility, Azer et al. (2023) reported that older adults showed reduced accuracy in a visual working memory task while maintaining moderate handgrip (30% MVC), whereas younger adults were relatively unaffected. This pattern is consistent with the idea that older adults may be more susceptible to combined physical-cognitive demands, either because resources are depleted more readily or because arousal becomes dysregulated under challenge (Verhaeghen et al., 2003). Importantly, these “resource competition” and “arousal regulation” perspectives are not mutually exclusive, and the present study is designed to adjudicate between them by modeling trialwise arousal (pupil state) alongside effort condition.

These frameworks motivate specific predictions for psychometric sensitivity. In a same–different task, the location (midpoint) parameter of the psychometric function can shift if arousal alters decision criterion or response bias—for example, by making participants more conservative or liberal in endorsing “different.” Such location shifts do not necessarily indicate true changes in discriminability, and in same–different paradigms location can reflect a mixture of sensitivity and criterion effects; therefore, threshold/location effects are interpreted cautiously. By contrast, the steepness of the psychometric function (inversely related to the scale parameter) provides a cleaner index of sensitivity because it reflects how rapidly response probability changes with stimulus intensity. If effort-induced arousal degrades sensory signal quality or increases internal noise, psychometric functions should become shallower (reduced sensitivity), requiring larger stimulus differences for reliable discrimination. Conversely, if moderate arousal optimizes neural gain and improves signal-to-noise ratio, functions may become steeper (enhanced sensitivity). Critically, these effects may be nonlinear: moderate arousal could improve sensitivity, while supra-optimal arousal—anticipated to occur more readily in older adults—could reduce sensitivity. For this reason, the present analyses emphasize steepness-related effects (and their modulation by pupil-indexed arousal) over threshold shifts alone.

1.4 Pupillometry as a Window into Arousal Dynamics

The LC–NE system is a central regulator of arousal and cognitive state. As the primary source of norepinephrine to the forebrain, LC–NE projections modulate neural gain and the signal-to-noise ratio of cortical processing (Aston-Jones & Cohen, 2005). AGT posits that task performance depends on a balance between tonic LC activity and phasic LC bursts, such that moderate tonic baseline paired with strong phasic responses supports focused attention and effective behavioral control (Gilzenrat et al., 2010).

Pupillometry provides a practical, noninvasive index of arousal dynamics that is often linked to LC–NE activity. Pupil diameter covaries with arousal-related neuromodulatory activity and can be characterized at two levels. First, baseline (tonic) pupil diameter reflects relatively sustained arousal state and has been associated with tonic LC activity and general alertness (Alnæs et al., 2014; Gilzenrat et al., 2010). Second, task-evoked pupil responses (TEPRs) reflect event-linked changes often interpreted as phasic arousal and mobilization of mental effort during task execution (Beatty, 1982; Kahneman & Beatty, 1966). TEPRs typically peak on the order of 1–2 seconds after stimulus onset and can be quantified using peak amplitude, mean dilation over a defined epoch, or the area under the curve (AUC) of the baseline-corrected signal, which jointly captures amplitude and duration. TEPR magnitude has been linked to task difficulty, cognitive load, surprise, and decision confidence, and in discrimination contexts larger TEPRs often accompany more difficult decisions, consistent with increased recruitment of control resources (Duchowski et al., 2018; Preuschoff et al., 2011; vanbergen2021?). The neural pathway linking LC to pupil diameter involves direct projections from LC to the Edinger–Westphal nucleus, which controls pupil constriction/dilation via parasympathetic and sympathetic pathways (Joshi et al., 2016; McDougal & Gamlin, 2010; Murphy et al., 2014).

Interpretation of pupil diameter nevertheless requires care because pupillometry is a multi-determined signal shaped by interacting neural and autonomic pathways. While LC–NE contributions to pupil fluctuations are central to the present theoretical framing, other structures and pathways can also influence pupil dynamics, and measurement artifacts can distort pupil estimates (Papesh & Goldinger, 2024). In particular, eye position and saccade-related effects can alter apparent pupil size (e.g., pupil foreshortening error and post-saccadic pupil constrictions), and blink-related missingness can bias mean or baseline estimates if not handled carefully (Hershman et al., 2024; Laeng & Mathôt, 2024). In cognitive tasks, TEPRs reflect event-linked changes that unfold with a physiological delay rather than instantaneously; pupil changes are typically detectable approximately 200–250 ms after task events, with responses tied to specific events such as stimulus presentation, response execution, or feedback (Hershman et al., 2024). This temporal delay justifies the use of epoch-based analysis windows and AUC metrics that capture the full time course of the response. These considerations motivate pairing theory-driven pupil metrics with explicit preprocessing and data-quality procedures designed to minimize artifactual variance and ensure that observed effects reflect task-evoked physiology rather than measurement confounds.

In aging, LC–NE regulation may change in ways that are directly relevant to arousal–performance coupling. Structural degradation of the LC has been observed in older adults (Mather & Harley, 2016), and functional compensatory patterns may emerge, including chronically elevated tonic arousal or altered phasic responsiveness to challenge (Lee et al., 2018; Mather et al., 2016). Such compensatory dynamics may support performance up to a point, but could also increase vulnerability to supra-optimal arousal, leading to distractibility and degraded information processing under high demand (Aston-Jones & Cohen, 2005; Eldar et al., 2013). Pupillometry studies in aging report heterogeneous patterns—often larger baseline pupils coupled with reduced task-evoked responses—suggesting altered tonic-phasic balance (Granholm et al., 2007; vanbergen2021?). Physical effort manipulations therefore provide an opportunity to test whether externally induced arousal shifts pupil dynamics and whether trialwise pupil state predicts changes in psychometric sensitivity.

1.5 Dual-Task Context and Competing Mechanisms

Dual-task paradigms have long been used to understand limits of attention and performance under concurrent demands. Limited-capacity and resource-competition accounts propose that performance decrements arise when combined task demands exceed available resources, particularly when tasks draw on overlapping pools of attention or executive control (Kahneman, 1973; Navon, 1984, 1985; Wickens, 2008). Multiple Resource Theory further predicts that interference depends on similarity across resource dimensions (e.g., spatial vs. verbal demands), such that tasks that share processing channels are more likely to interfere (Wickens, 2008). Applied to physical–cognitive dual tasks, interference may arise through sustained attentional requirements for maintaining force, physiological arousal effects that can help or harm performance depending on state, and motor–cognitive competition that affects response selection and execution (Proctor, 2012; Woollacott & Shumway-Cook, 2002). Age-related increases in dual-task costs are well documented and may reflect reduced reserve, altered executive control, or changes in arousal regulation (Beurskens et al., 2014; Verhaeghen et al., 2003).

The present study explicitly recognizes that effort effects may reflect both generic resource competition and arousal-mediated mechanisms. Resource-competition models predict effort-related performance decrements as demands rise, whereas arousal-based accounts (including LC–NE mechanisms) predict that the quality of information processing may change as a function of arousal state, potentially independent of capacity limits. The key advantage of pairing pupillometry with psychometric modeling is that it allows these mechanisms to be separated empirically: effort condition provides an experimental manipulation, while trialwise pupil state indexes moment-to-moment arousal, and the psychometric framework distinguishes sensitivity changes (steepness) from criterion shifts (intercept/location). Accordingly, effort main effects in the GLMM may reflect a mixture of resource and arousal influences, whereas a significant stimulus intensity × pupil-state interaction would more specifically support arousal-linked modulation of sensitivity.

1.6 Linking Arousal to Psychometric Sensitivity: Analytic Strategy

Traditional analyses of arousal–performance often collapse continuous stimulus intensity into coarse difficulty bins (e.g., “easy” vs. “hard”), which discards information and obscures how arousal modulates the intensity–response relationship itself. The present chapter instead models trialwise responses across the continuous intensity dimension, enabling direct estimation of psychometric sensitivity and its modulation by arousal.

A central methodological challenge in pupil–behavior analyses is separating within-person “state” fluctuations from between-person “trait” differences. To do so, each pupil metric (\(P_{ij}\)) is decomposed into a subject-mean component and a trial-wise deviation using within-subject centering:

\[P^{(\text{trait})}_{j} = \overline{P}_{j}, \qquad P^{(\text{state})}_{ij} = P_{ij} - \overline{P}_{j} \tag{2}\]

This decomposition enables two distinct questions to be tested in the same model: whether individuals with higher average arousal differ in sensitivity (trait effects), and whether trials on which an individual is more aroused than usual show altered sensitivity (state effects). Importantly, this approach prevents conflating between- and within-person patterns—an especially relevant concern in aging where trait differences may be influenced by health, LC integrity, or cognitive reserve (Enders, 2013; Hoffman, 2015). For example, if high-arousal individuals tend to have poorer performance for reasons unrelated to trial-wise fluctuations (e.g., underlying LC degeneration or comorbidities), that pattern will be captured at the trait level rather than falsely attributed to trial-wise changes in arousal. Figure 2 illustrates this decomposition visually, showing how raw pupil metrics are separated into trait (between-person) and state (within-person) components.

Figure 2: Decomposition of pupil metrics into state and trait components

Primary inference is implemented using hierarchical generalized linear mixed models (GLMMs) with a probit link, modeling trialwise \(P(Y_{ij}=1)\) as a function of continuous stimulus intensity, effort condition, modality, and pupil-indexed arousal while accounting for individual differences via random effects:

\[ \begin{aligned} \text{probit}(P(Y_{ij}=1)) &= \beta_0 + \beta_1 X_{ij} + \beta_2 \text{Effort}_{ij} + \beta_3 \text{Modality}_{ij} + \beta_4 P^{(\text{state})}_{ij} \\ &\quad + \beta_5 (X_{ij} \times P^{(\text{state})}_{ij}) + \beta_6 P^{(\text{trait})}_{j} + u_{0j} + u_{1j}X_{ij} \end{aligned} \tag{3}\]

where \(Y_{ij}\) is the binary outcome on trial \(i\) for subject \(j\), \(X_{ij}\) is the continuous stimulus intensity, \(\text{Effort}_{ij}\) codes the handgrip effort condition (High vs. Low), and \(\text{Modality}_{ij}\) codes the sensory modality (auditory vs. visual). \(P^{(\text{state})}_{ij}\) and \(P^{(\text{trait})}_{j}\) are the state and trait components of the pupil metric, respectively, and \(u_{0j}\) and \(u_{1j}\) are subject-specific random intercepts and random slopes on \(X_{ij}\), allowing each participant to have their own baseline performance level and intensity sensitivity.

In this framework, the key parameter is the stimulus intensity × pupil-state interaction (\(\beta_5\)), which tests whether moment-to-moment arousal modulates psychometric sensitivity. A significant interaction indicates that the intensity–response slope differs as a function of trialwise arousal—consistent with arousal-linked changes in sensitivity—whereas changes limited to intercept/location terms would be more consistent with shifts in response tendency or criterion.

1.7 Research Questions and Hypotheses

Based on the theoretical rationale above, the present study addresses four research questions. First, we test how high (40% MVC) versus low (5% MVC) effort affects behavioral psychometric function parameters (thresholds and slopes) in older adults performing auditory and visual same–different discrimination. We expect midpoint thresholds to be higher under high effort, acknowledging that threshold shifts may reflect sensitivity and/or criterion changes, and we predict shallower slopes under high effort, consistent with reduced discriminability under elevated arousal.

Second, we test whether high effort increases tonic and task-evoked pupil dynamics relative to low effort. We predict larger baseline pupil diameter and larger TEPR (AUC) under high effort, indicating elevated tonic and phasic arousal, respectively.

Third, we test whether trialwise phasic arousal predicts psychometric sensitivity when intensity is modeled continuously. The primary hypothesis is that the stimulus intensity × pupil-state interaction will be negative (shallower slopes on higher-arousal trials), consistent with supra-optimal arousal increasing noise or degrading signal quality; an alternative hypothesis is that moderate arousal could enhance sensitivity, yielding a positive interaction. We also expect weaker or null associations for pupil-trait predictors, given that trait effects may be confounded with other individual differences such as LC integrity or cognitive reserve.

Finally, we test whether individuals with larger effort-evoked arousal changes (Δpupil) exhibit larger effort-related changes in psychometric parameters (Δthreshold, Δslope), predicting that stronger physiological reactivity is associated with larger behavioral decrements.

2 Methods

2.1 Participants

Approximately 50 healthy older adults (target N ≈ 50 after QC; final N may vary modestly), aged 65+ years, completed the dual-task paradigm.

2.2 Task Paradigm

Participants performed same–different discrimination tasks in two modalities while simultaneously squeezing a dynamometer:

Effort Conditions:

  • Low effort: 5% of maximum voluntary contraction (MVC)

  • High effort: 40% MVC

Stimuli:

  • Auditory Discrimination Task (ADT): 1000 Hz base tones with frequency offsets of +8, +16, +32, or +64 Hz

  • Visual Discrimination Task (VDT): Oriented Gabor patches with contrast differences of +0.06, +0.12, +0.24, or +0.48

Trial Structure: Each trial followed this sequence: 1. Pre-squeeze baseline: 3 seconds 2. Sustained squeeze period: 3 seconds (Low or High effort) 3. Standard stimulus: 0.1 seconds 4. Inter-stimulus interval (ISI): 0.5 seconds 5. Target stimulus: 0.1 seconds 6. Response window: Participants indicated whether the target was “same” or “different” from the standard

2.3 Pupillometry

Pupil diameter was recorded continuously throughout each trial using an eye-tracking system. The data were processed using a MATLAB preprocessing pipeline that segments trials relative to squeeze onset, flags blinks and track loss events, computes window-specific validity metrics, and implements baseline correction using pre-event windows.

2.3.1 Pupil Features

Two primary features were computed on each trial:

  1. Total AUC: Baseline-corrected area under the pupil curve over a global trial window, indexing overall arousal during concurrent physical effort.

  2. Cognitive pupil metric (primary): Baseline-corrected task-evoked measure computed in a fixed 1.20-second stimulus-locked window from 4.85s to 6.05s (target onset + 0.50s to target onset + 1.70s, relative to squeeze onset). This window captures the task-evoked pupil response (TEPR) peak, which typically occurs ~1 second after stimulus onset (Mathôt & Vilotijević, 2023; bauer2021?), while avoiding early reflex components and reducing confounding by response time. The primary metric used in analyses is mean dilation (cog_mean = cog_auc / window_duration), which removes duration confounds and is preferred over raw AUC.

2.3.2 Gap-Aware Quality Control Metrics

In addition to percentage-validity thresholds, we implemented gap-aware quality control metrics to identify trials with problematic missing data patterns. Percentage-validity thresholds are necessary but not sufficient: a trial can have acceptable overall validity (e.g., 60%) but contain large contiguous gaps that distort AUC estimates. Following recommendations from Kret & Sjak-Shie (2019), we computed five gap-aware metrics for each trial:

  1. cog_auc_max_gap_ms: Maximum contiguous missing segment (milliseconds) within the cognitive window. Large gaps (>250ms) should not be interpolated and can severely distort AUC estimates even when percent-valid looks acceptable (Kret & Sjak-Shie, 2019).
  2. cog_window_duration: Actual duration of the cognitive window (seconds) after accounting for data availability. For the fixed 1.20s window, this should be close to 1.20s.
  3. cog_auc_n_valid: Number of valid (non-NA) samples in the cognitive window. At 250 Hz sampling rate, the 1.20s window should contain ~300 samples.
  4. cog_auc_n_segments: Number of contiguous valid segments (1 = continuous, >1 = fragmented). Highly fragmented trials may have unreliable AUC estimates.
  5. cog_auc_prop_valid: Proportion of valid samples (n_valid / expected_samples_at_250Hz).

These metrics help distinguish between genuine low pupil dilation (high quality, small gaps, low mean dilation) and data quality artifacts (low quality or large gaps, low mean dilation), ensuring that valid physiological data are retained while problematic trials are excluded.

2.3.3 Quality Tiers

Pupil-based analyses used pre-specified quality tiers that combine percentage-validity thresholds with gap-aware filtering:

  • Primary tier:
    • B1 baseline quality ≥ 0.50 AND cognitive window quality ≥ 0.60
    • Maximum gap ≤ 250ms (gap-aware filter)
    • Window duration ≥ 0.90s (75% of 1.20s window)
    • Valid samples ≥ 240 (80% of expected ~300 samples)
  • Lenient tier:
    • B1 baseline quality ≥ 0.50 AND cognitive window quality ≥ 0.50
    • Maximum gap ≤ 250ms (gap-aware filter)
    • Window duration ≥ 0.75s (relaxed duration threshold)
  • Strict tier:
    • B1 baseline quality ≥ 0.50 AND cognitive window quality ≥ 0.70
    • Maximum gap ≤ 250ms (gap-aware filter)
    • Window duration ≥ 0.90s AND valid samples ≥ 240 (strict thresholds)

Rationale for Quality Tiers: Lapse-Rate Considerations in Aging

Psychophysical modeling guidance emphasizes that assuming lapse rate (\(\lambda\)) = 0 can distort threshold and slope estimates if lapses are present but unaccounted for (Prins et al., 2016). This concern is particularly relevant in older adult populations, where attentional lapses, track loss events, and data quality issues may be more frequent. The quality tier strategy addresses this by excluding trials with poor pupil data quality (low validity or large gaps), thereby reducing lapse-like contamination in the psychometric function estimates. The tiering approach tests robustness of conclusions across stricter inclusion criteria: if key effects (e.g., the stimulus intensity × pupil state interaction) are consistent across lenient, primary, and strict quality tiers, this provides evidence that findings are not driven by lapse-like errors or data quality artifacts. This strategy is consistent with psychophysical best practices for handling stimulus-independent errors without explicitly estimating lapse parameters in the primary model (Prins et al., 2016).

2.3.4 Confound Mitigation: Motor/Response Screen Contamination

The cognitive AUC window faces unavoidable temporal overlap with multiple task events that could contaminate pupil measurements:

  1. Target stimulus onset: 4.35s (relative to squeeze onset)
  2. Response screen appears: 4.70s (only 350ms after target)
  3. Button press: Variable (4.70s + RT, typically 5.3-5.4s for median RTs of 0.6-0.7s)

The goal is not to “eliminate confounds” (which is impossible given the temporal overlap), but to minimize contamination, model what cannot be avoided, and demonstrate robustness through multiple sensitivity checks.

Mitigation Strategies Implemented:

  1. Stimulus-locked fixed window: The primary cognitive window (4.85-6.05s) is stimulus-locked and fixed in duration, which reduces mechanical RT-AUC coupling confounds. By using mean dilation (AUC/duration) rather than raw AUC, we further remove duration-related artifacts.

  2. RT as covariate: Reaction time is included as a covariate in all models to control for decision state and reduce RT-pupil coupling confounds. This addresses the fact that RT correlates with decision processes and may drive apparent pupil-behavior relationships.

  3. Motor buffer truncation (available for future sensitivity analyses): Window definitions are computed to truncate the primary window 150ms before button press (cog_win_primary_end_motorbuffered), avoiding motor execution/movement/blink contamination. Full AUC computation for motor-buffered windows requires re-processing flat files with RT information and is available for future sensitivity analyses.

  4. Pre-response window (available for future sensitivity analyses): Decision-aligned window definitions are computed that exclude motor execution entirely (500ms before response, excluding last 150ms). This captures decision-related arousal without motor contamination. Full AUC computation for pre-response windows requires re-processing and is available for future sensitivity analyses.

  5. Slow-RT sensitivity subset: A flag (cog_win_uncontaminated_by_motor) identifies trials where motor cannot contaminate the primary window (RT > 1.5s, such that button press occurs after 6.20s). This subset can be used for sensitivity analyses to demonstrate that findings are not driven by motor contamination.

These mitigation strategies are documented and available in the data, with full AUC computation for motor-buffered and pre-response windows available for future sensitivity analyses when needed.

2.4 Locus Coeruleus (LC) Integrity

2.4.1 LC Integrity Quantification

For a subset of participants, structural magnetic resonance imaging (MRI) data were available to quantify LC integrity. LC integrity was assessed using contrast-based metrics derived from magnetization transfer contrast (MTC) imaging, following the methodology described in Bennett et al. (2024, Scientific Reports). The primary LC integrity metric was computed as the mean MTC signal within the LC mask relative to the pons reference region (MTCLC/pons). This mean-within-mask approach was chosen over maximum-voxel extraction because it is more robust to uneven LC degeneration patterns that can affect maximum-voxel metrics (Bennett et al., 2024). Additionally, diffusion-based metrics (e.g., restricted diffusion fraction) were available as complementary integrity indices, with prior work suggesting that diffusion metrics may relate to cognitive performance independent of age (Bennett et al., 2024).

2.4.2 Conceptual Relevance

LC structural integrity provides a trait-level index of neuromodulatory capacity that complements the state-level (trial-wise) and trait-level (between-subject mean) pupil metrics. The LC–NE system serves as the primary source of norepinephrine to the forebrain and modulates cortical gain according to Adaptive Gain Theory (Aston-Jones & Cohen, 2005). Structural integrity of the LC may constrain both baseline (tonic) and task-evoked (phasic) LC–NE function, potentially explaining individual differences in arousal reactivity and its relationship to cognitive performance. In the context of Chapter 2’s focus on pupil-indexed arousal and psychometric sensitivity, LC integrity may help explain why some individuals show stronger or weaker coupling between arousal state and perceptual sensitivity, beyond what is captured by pupil trait (mean arousal level).

2.4.3 LC Integrity Analyses: Exploratory Extension

LC integrity analyses are positioned as exploratory and secondary to the primary pupil–psychometric coupling analyses specified in the prospectus. These analyses were motivated by the theoretical framework’s emphasis on LC–NE mechanisms and by the recognition that trait-level pupil metrics (mean arousal) may be confounded with underlying LC structural integrity or other individual-difference factors (e.g., cognitive reserve, general health). However, because LC integrity data were not part of the original prospectus and are available for only a subset of participants, these analyses are treated as exploratory extensions that may inform future research rather than as confirmatory tests of primary hypotheses.

2.5 Statistical Analysis

2.5.1 Psychometric Function Fitting

Psychometric function parameters (thresholds and slopes) used as the behavioral backbone were estimated using Psignifit 4 (MATLAB; Schütt et al. (2016)), which supports robust estimation and can account for overdispersion. A Weibull psychometric function was fit to the proportion of “different” responses across continuous stimulus intensity levels, separately for each participant × task/modality × effort condition. Because the Weibull function is undefined at zero stimulus intensity, the 0-offset level was replaced with half of the smallest non-zero stimulus intensity. Guess and lapse parameters were allowed to vary freely during fitting (i.e., \(\gamma\) and \(\lambda\) were free parameters), consistent with psychophysical guidance that fixing lapse at zero can bias threshold and slope estimates when lapses are present (Prins et al., 2016; Wichmann & Hill, 2001). Threshold and slope were derived at the effective midpoint between the estimated guess rate \(\gamma\) and \(1-\lambda\) (i.e., at \(\gamma + (1-\gamma-\lambda)/2\), which corresponds to approximately 0.5 on the effective response probability scale). This midpoint-based threshold definition ensures that threshold reflects the stimulus level at which “different” responses occur at the midpoint between the lower and upper asymptotes, providing a criterion that is robust to variations in guess and lapse rates and making interpretation comparable across different PF families and parameterizations.

2.5.2 Model Checking and Goodness-of-Fit

Likelihood-based psychometric function modeling typically evaluates fit using likelihood-ratio/deviance approaches, and PF-based inference relies on assumptions of stability and independence across trials (Prins et al., 2016). In the present hierarchical GLMM framework, model selection and goodness-of-fit are assessed through several complementary approaches: (a) AIC comparisons between models with and without key interaction terms (e.g., stimulus intensity × pupil state), where lower AIC values indicate better fit; (b) robustness checks across multiple quality tiers (lenient, primary, strict), testing whether key effects are stable across different data inclusion criteria; and (c) convergence diagnostics for mixed-effects models (e.g., gradient checks, optimizer settings). The hierarchical structure accounts for within-subject dependencies through random effects, and the continuous-intensity GLMM avoids the trial-binning assumptions that can affect traditional PF fitting. While parametric or nonparametric bootstrap approaches can provide additional uncertainty quantification for PF parameters (Prins et al., 2016), the present GLMM framework uses likelihood-based confidence intervals and robustness checks across quality tiers as the primary approach to evaluating parameter stability and inference validity.

2.5.3 Primary GLMM

The primary model used a probit link function to link phasic arousal to psychometric sensitivity:

\[ \begin{aligned} \text{probit}(P(Y_{ij}=1)) &= \beta_0 + \beta_1 X_{ij} + \beta_2 \text{Effort}_{ij} + \beta_3 \text{Modality}_{ij} + \beta_4 P^{(\text{state})}_{ij} \\ &\quad + \beta_5 \text{RT}_{ij} + \beta_6 (X_{ij} \times P^{(\text{state})}_{ij}) + \beta_7 P^{(\text{trait})}_{j} + u_{0j} + u_{1j}X_{ij} \end{aligned} \]

where \(Y_{ij}\) is the binary choice on trial \(i\) for subject \(j\), \(X_{ij}\) is continuous stimulus intensity, \(P^{(\text{state})}_{ij}\) is the within-subject centered pupil metric (mean dilation, trial value - subject mean), and \(P^{(\text{trait})}_{j}\) is the between-subject pupil metric (subject mean). The term \(\beta_5 \text{RT}_{ij}\) includes reaction time as a covariate to control for decision state and reduce RT-pupil coupling confounds. The key interaction term \(X_{ij} \times P^{(\text{state})}_{ij}\) (coefficient \(\beta_6\)) tests whether within-person fluctuations in arousal are associated with changes in psychometric sensitivity. The term \(\beta_7 P^{(\text{trait})}_{j}\) tests whether between-subject differences in average arousal relate to sensitivity, while controlling for state-level effects. The random effects \(u_{0j}\) and \(u_{1j}\) allow each participant to have their own baseline performance level and intensity sensitivity, respectively.

2.5.4 Sensitivity Analysis and Robustness Checks

To ensure the robustness of our primary findings, we conducted several sensitivity analyses:

1. Quality Tier Robustness Checks

The primary analysis used the primary quality tier (B1 baseline quality ≥ 0.50, cognitive window quality ≥ 0.60, maximum gap ≤ 250ms, window duration ≥ 0.90s, valid samples ≥ 240). To assess whether results were sensitive to data quality thresholds, we re-ran the primary GLMM using two alternative quality tiers:

  • Lenient tier: B1 baseline quality ≥ 0.50, cognitive window quality ≥ 0.50, maximum gap ≤ 250ms, window duration ≥ 0.75s (includes more trials, potentially with lower data quality)
  • Strict tier: B1 baseline quality ≥ 0.50, cognitive window quality ≥ 0.70, maximum gap ≤ 250ms, window duration ≥ 0.90s, valid samples ≥ 240 (includes fewer trials, but with higher data quality)

If the key interaction effect (stimulus intensity × pupil state) is consistent across all three quality tiers, this provides evidence that the finding is robust to different data inclusion criteria.

2. Slow-RT Sensitivity Subset

To assess whether findings are robust to motor contamination concerns, we conducted a sensitivity analysis using only trials where motor cannot contaminate the primary window (cog_win_uncontaminated_by_motor == TRUE, i.e., RT > 1.5s such that button press occurs after 6.20s). If the key interaction effect is consistent in this slow-RT subset, this provides evidence that findings are not driven by motor contamination.

2. Model Comparison

To evaluate whether the interaction term meaningfully improves model fit, we compared the primary model (with the stimulus intensity × pupil state interaction) to an alternative model that excluded the interaction term. Models were compared using Akaike Information Criterion (AIC), where lower AIC values indicate better model fit. A substantial improvement in fit (ΔAIC > 2) for the model with the interaction would support the inclusion of this term and the interpretation that pupil state modulates psychometric sensitivity.

3. Missingness Diagnostic

As described in the Missingness Diagnostic section, we tested whether missing pupil data were systematically related to experimental conditions (effort, stimulus intensity, task) or behavioral measures (response time). If missingness is random or only related to non-experimental variables, this supports the validity of the primary analyses.

2.5.5 Exploratory LC Integrity Extension

For participants with available LC integrity data, we extended the primary GLMM to examine whether LC structural integrity moderates the relationship between pupil-indexed arousal and psychometric sensitivity. The extended model includes:

\[ \begin{aligned} \text{probit}(P(Y_{ij}=1)) &= \beta_0 + \beta_1 X_{ij} + \beta_2 \text{Effort}_{ij} + \beta_3 \text{Modality}_{ij} + \beta_4 P^{(\text{state})}_{ij} \\ &\quad + \beta_5 \text{RT}_{ij} + \beta_6 (X_{ij} \times P^{(\text{state})}_{ij}) + \beta_7 P^{(\text{trait})}_{j} + \beta_8 \text{LC}_j \\ &\quad + \beta_9 (X_{ij} \times P^{(\text{state})}_{ij} \times \text{LC}_j) + \beta_{10} \text{Age}_j + \beta_{11} \text{Sex}_j \\ &\quad + \beta_{12} \text{Education}_j + u_{0j} + u_{1j}X_{ij} \end{aligned} \]

where \(\text{LC}_j\) is the LC integrity metric (mean MTCLC/pons) for subject \(j\), and the three-way interaction \(X_{ij} \times P^{(\text{state})}_{ij} \times \text{LC}_j\) tests whether LC integrity moderates the coupling between pupil state and psychometric sensitivity. Age, sex, and education are included as covariates to control for potential confounds. The random effects structure remains consistent with the primary model. These analyses are exploratory given that (a) LC integrity data were not part of the original prospectus, (b) LC data are available for only a subset of participants, and (c) the primary focus remains on pupil–psychometric coupling rather than LC integrity per se.

2.5.6 Analysis Plan: Confirmatory vs. Exploratory

The primary analyses specified in the prospectus are treated as confirmatory: - Effort effects on psychometric function parameters (RQ1) - Effort–pupil manipulation check (RQ2) - Trial-level pupil–psychometric coupling (RQ3, primary analysis) - Subject-level PF–pupil coupling (RQ4)

Sensitivity analyses (quality tier robustness checks, model comparison) are pre-specified and serve to evaluate the robustness of primary findings.

Exploratory analyses include: - LC integrity moderation analyses (not in original prospectus) - Any post-hoc analyses suggested by the data

3 Results

The following sections present results from the complete analysis pipeline, including behavioral psychometric function parameters, effort–pupil manipulation checks, missingness diagnostics, primary pupil–psychometric coupling analyses, and subject-level individual differences. All analyses were conducted using the primary quality tier (baseline and cognitive window validity ≥ 0.60) unless otherwise specified.

3.1 Data Quality and Sample Sizes

Table 1: Data Quality Summary: Pupil Data Availability by Task and Effort
Task N Subjects Total Trials Primary Tier Prop Primary Mean Baseline Quality Mean Cognitive Quality
ADT 54 6404 2900 0.440 0.499 0.549
VDT 56 6311 3440 0.555 0.580 0.630

3.2 Behavioral Backbone (PF Outcomes)

Table 2: Psychometric Function Parameters by Task and Effort
Psychometric Function Parameters by Task and Effort
Task Effort N Threshold (M) Threshold (SD) Slope (M) Slope (SD)
ADT High 51 2.128 1.126 1.536 1.776
ADT Low 51 1.486 2.640 1.429 1.764
VDT High 51 1.950 0.590 1.898 1.889
VDT Low 51 1.855 0.681 1.569 1.518

Psychometric function parameters (thresholds and slopes) were estimated separately for each subject × task × effort combination using probit link functions with continuous stimulus intensity.

Hypothesis Testing: Hypothesis 1a predicted that midpoint thresholds would be higher under High effort relative to Low effort, which could reflect degraded signal-to-noise ratios (sensitivity changes) or shifts in decision criterion (bias changes), as threshold in same–different paradigms can reflect both. Hypothesis 1b predicted that slopes would be shallower (lower sensitivity, higher \(\beta\)) under High effort, indicating increased variability in perceptual judgments when arousal is elevated. Steepness (inversely related to \(\beta\)) is emphasized as the primary sensitivity index, as it is less confounded by criterion shifts than threshold. These hypotheses were tested by comparing PF parameters between High and Low effort conditions within each task modality.

Figure 3: Psychometric Functions by Effort Condition

3.3 Effort–Pupil Manipulation Check

Table 3: Effort Effect on Pupil Metrics
Effort Effect on Pupil Metrics (High vs Low Effort)
Pupil Metric Estimate SE t
Total AUC 0.857 0.376 2.282
Cognitive AUC -0.012 0.003 -3.859

High-effort handgrip increased pupil dynamics relative to Low effort, confirming effective engagement of central arousal systems. Mixed-effects models tested whether effort condition predicted pupil metrics (Total AUC and Cognitive AUC) while controlling for task modality.

Hypothesis Testing: Hypothesis 2a predicted that baseline (tonic) pupil diameter would be larger under High effort, and Hypothesis 2b predicted that task-evoked pupil responses (AUC) would be larger under High effort. Both hypotheses were supported if High effort significantly increased Total AUC and Cognitive AUC relative to Low effort, providing a physiological manipulation check that physical effort modulates central arousal systems in older adults.

Figure 4: Effort–Pupil Manipulation Check

3.4 Missingness Diagnostic

Missingness analyses tested whether pupil data retention was predicted by effort, stimulus intensity, modality, or response time using logistic mixed-effects models. This diagnostic is critical for assessing potential systematic bias in pupil data availability.

Table 4: Missingness Diagnostic: Predictors of Pupil Data Usability
Predictor Estimate (OR) SE z p
Effort (High vs Low) 1.203 0.061 3.672 0.000
Stimulus Intensity 1.015 0.026 0.575 0.565
Task (VDT vs ADT) 1.967 0.106 12.584 0.000
Response Time 1.001 0.029 0.023 0.982
Figure 5: Missingness Diagnostic: RT by Pupil Data Availability

[Results interpretation: If effort significantly predicts missingness, this indicates potential bias that should be acknowledged in interpretation. If missingness is random or predicted only by non-experimental variables, this supports the validity of the primary analyses.]

3.5 Pupil–Psychometric Coupling (Primary Analysis)

Table 5: Primary GLMM: Pupil–Psychometric Coupling
Term Estimate SE z p
(Intercept) -0.1076 0.0938 -1.1468 0.2515
stimulus_intensity_scaled 1.3070 0.0834 15.6747 0.0000
pupil_cognitive_state_scaled -0.0457 0.0239 -1.9119 0.0559
effort_factorHigh -0.1391 0.0460 -3.0229 0.0025
task_factorVDT 0.2561 0.0554 4.6215 0.0000
pupil_cognitive_trait_scaled -0.0525 0.0911 -0.5770 0.5640
stimulus_intensity_scaled:pupil_cognitive_state_scaled 0.0353 0.0304 1.1638 0.2445

The primary analysis tested whether trial-wise phasic arousal (pupil state) predicts psychometric sensitivity when stimulus intensity is modeled continuously. The key test is the interaction between stimulus intensity and pupil cognitive state, which directly assesses whether within-person fluctuations in arousal modulate psychometric sensitivity.

Hypothesis Testing: Hypothesis 3a predicted that the interaction between stimulus intensity and pupil state (\(X_{ij} \times P^{(\text{state})}_{ij}\)) would be negative, indicating that higher trial-level arousal is associated with shallower psychometric slopes (reduced sensitivity), consistent with supra-optimal arousal degrading signal quality. The alternative Hypothesis 3b allowed for a positive interaction if moderate arousal enhances sensitivity. Hypothesis 3c predicted minimal effects of pupil trait (between-subject baseline arousal) on sensitivity, as trait effects may be confounded with other individual differences.

Interpretation of Results: If the interaction term is non-significant, this suggests that within-person fluctuations in arousal (pupil state) do not reliably modulate psychometric sensitivity beyond the effects of effort condition. This could indicate that (a) effort main effects (captured by the effort factor in the model) are the primary drivers of sensitivity changes, potentially reflecting resource competition rather than moment-to-moment arousal fluctuations, (b) the relationship between arousal and sensitivity is weaker than expected in older adults, or (c) the fixed-window cognitive pupil metric may not capture the most relevant arousal dynamics for psychometric sensitivity. However, a non-significant interaction does not negate the possibility that effort-induced arousal changes affect performance at the group level (via effort main effects); it suggests that trial-wise arousal fluctuations may not be the primary mechanism. The robustness checks across quality tiers and model comparison help evaluate whether this pattern is consistent and whether the interaction term meaningfully improves model fit.

Figure 6: Psychometric Functions by Pupil State (Primary Analysis)
Figure 7: Model Predictions: Stimulus × Pupil State Interaction

3.5.1 Robustness Checks and Sensitivity Analyses

To assess the robustness of our primary findings, we conducted several sensitivity analyses:

1. Quality Tier Robustness Checks

Results were tested across multiple quality tiers (lenient ≥0.50, primary ≥0.60, strict ≥0.70) to ensure robustness to different data inclusion criteria. If the key interaction effect is consistent across all three quality tiers, this provides evidence that the finding is robust to different data quality thresholds.

Table 6: Robustness Checks: Interaction Effect Across Quality Tiers
Quality Tier Estimate SE z p
Lenient (≥0.50) 0.0644 0.0277 2.3273 0.0200
Primary (≥0.60) 0.0353 0.0304 1.1638 0.2445
Strict (≥0.70) 0.0390 0.0321 1.2130 0.2251

2. Model Comparison

To evaluate whether the interaction term meaningfully improves model fit, we compared the primary model (with stimulus intensity × pupil state interaction) to an alternative model without the interaction term. Models were compared using Akaike Information Criterion (AIC), where lower AIC values indicate better fit.

Table 7: Model Comparison: With vs. Without Interaction Term

**Interpretation:** The model without the interaction term shows better fit, suggesting the interaction may not meaningfully improve model fit.

3. Missingness Diagnostic

As reported in the Missingness Diagnostic section above, we tested whether missing pupil data were systematically related to experimental conditions (effort, stimulus intensity, task, response time). If missingness is random or only related to non-experimental variables (e.g., individual differences in eye-tracking quality), this supports the validity of the primary analyses. If missingness is systematically related to experimental conditions, this could introduce bias and should be acknowledged in interpretation, with robustness checks helping to evaluate whether findings are sensitive to missingness patterns.

3.6 Subject-Level PF–Pupil Coupling

Table 8: Subject-Level Correlations: ΔPupil vs ΔPF Parameters
Pupil Metric PF Parameter r CI Lower CI Upper p N
Cognitive AUC Threshold -0.200 -0.406 0.025 0.081 77
Cognitive AUC Slope 0.149 -0.077 0.361 0.195 77
Total AUC Threshold -0.212 -0.416 0.013 0.064 77
Total AUC Slope -0.054 -0.275 0.172 0.639 77

Subject-level changes in pupil metrics (High–Low effort) were correlated with subject-level changes in PF parameters (High–Low effort). This analysis tests whether individuals who show stronger physiological responses to effort also show larger behavioral sensitivity changes.

Hypothesis Testing: Hypothesis 4 predicted that subject-level changes in pupil metrics (Δpupil = High effort − Low effort) would be positively correlated with changes in thresholds (Δthreshold) and negatively correlated with changes in slopes (Δslope), indicating that individuals with stronger effort-evoked arousal responses also show larger behavioral sensitivity decrements. This hypothesis was tested using Pearson correlations between Δpupil and ΔPF parameters, separately for each task modality.

Figure 8: Subject-Level PF–Pupil Coupling
Figure 9: Correlation Matrix: Subject-Level Changes

3.7 Exploratory LC Integrity Analyses

Table 9: LC Integrity Subset: Comparison of Participants With vs. Without LC Data
*LC subset comparison not yet available.*

**Status:** LC integrity data are currently being added to the master spreadsheet. Once the data are available and the analysis script has been run, this table will display comparisons between participants with and without LC integrity data.

LC Integrity as Moderator of Pupil–Psychometric Coupling

Table 10: Exploratory LC Integrity Extension: Three-Way Interaction
*LC integrity extension analyses not yet available.*

**Status:** LC integrity data are currently being added to the master spreadsheet for subjects BAP 191-202 (excluding 193, 198). Once the LC integrity scores are available in the master spreadsheet, the data should be extracted to `data/processed/ch2_lc_integrity.csv` and the analysis script `04_pupil_psychometric_coupling/06b_lc_integrity_extension.R` should be run to generate the results. Behavioral and pupil data for these subjects are expected to already be available in the merged dataset.

3.8 Summary of Hypothesis Testing


**Note:** This table will be updated once all analyses are complete. The 'Support Status' column will indicate whether each hypothesis was supported, not supported, or partially supported based on the statistical results.
Table 11: Summary of Hypothesis Testing Results
Hypothesis Support Status Key Evidence
H1a: Thresholds higher under High effort To be evaluated from PF parameters PF threshold comparison (High vs Low effort)
H1b: Slopes shallower under High effort To be evaluated from PF parameters PF slope comparison (High vs Low effort)
H2a: Baseline pupil larger under High effort To be evaluated from effort-pupil check Total AUC and Cognitive AUC by effort
H2b: Task-evoked pupil larger under High effort To be evaluated from effort-pupil check Total AUC and Cognitive AUC by effort
H3a: Negative interaction (stimulus × pupil state) To be evaluated from primary GLMM Interaction term: stimulus_intensity × pupil_state
H3b: Positive interaction (alternative) To be evaluated from primary GLMM Interaction term: stimulus_intensity × pupil_state
H3c: Minimal pupil trait effects To be evaluated from primary GLMM Pupil trait main effect
H4: ΔPupil correlated with ΔPF parameters To be evaluated from subject-level correlations Correlation: Δpupil vs Δthreshold, Δslope

4 Discussion

4.1 Integration with Theoretical Frameworks

The present findings contribute to understanding how physical effort and arousal relate to perceptual sensitivity in older adults. The results are interpreted in light of several theoretical frameworks introduced earlier: the Yerkes–Dodson law (inverted-U relationship between arousal and performance), Adaptive Gain Theory (optimal LC–NE balance for performance), and resource competition models (limited capacity for dual-task performance).

4.1.1 Arousal–Performance Relationships

If the primary interaction (stimulus intensity × pupil state) is significant and negative, this would support the hypothesis that higher trial-level arousal is associated with reduced psychometric sensitivity, consistent with supra-optimal arousal degrading signal quality in older adults. This pattern would align with the Yerkes–Dodson framework’s prediction that older adults may be more easily pushed onto the descending limb of the arousal–performance curve (Huang & Clewett, 2024; Mather & Harley, 2016). Alternatively, if the interaction is non-significant, this suggests that moment-to-moment arousal fluctuations may not be the primary mechanism linking effort to sensitivity changes. In this case, effort main effects (which may reflect both resource competition and group-level arousal changes) may be the dominant drivers of performance differences.

4.1.2 Resource Competition vs. Arousal Mechanisms

The chapter’s focus on pupil-indexed arousal does not preclude resource competition as a contributing mechanism. Effort main effects in the GLMM capture overall High vs. Low effort differences, which may reflect both generic dual-task interference and arousal-mediated changes. If trial-wise pupil state does not significantly predict sensitivity beyond effort condition, this would suggest that resource competition or effort-induced group-level arousal changes (rather than moment-to-moment fluctuations) are the primary drivers. The psychometric function framework allows us to distinguish sensitivity changes (signal quality) from criterion shifts (response strategy), providing insight into whether effort effects reflect altered evidence quality (consistent with arousal effects on neural gain) versus strategic adjustments (consistent with resource allocation).

4.1.3 State vs. Trait Arousal Effects

The decomposition of pupil metrics into state (within-person) and trait (between-person) components allows us to test distinct hypotheses about how arousal relates to sensitivity. If pupil state significantly predicts sensitivity (via the interaction), this supports the hypothesis that moment-to-moment arousal fluctuations modulate perceptual processing. If pupil trait shows minimal effects (as predicted by Hypothesis 3c), this suggests that stable individual differences in average arousal may be confounded with other factors (e.g., LC integrity, cognitive reserve, general health) that are better captured by direct measures of those constructs.

4.2 Implications for Aging and Arousal

The findings have implications for understanding how aging modulates the relationship between physiological arousal and cognitive performance. If older adults show weaker or absent coupling between trial-wise arousal and sensitivity (compared to younger adults in Chapter 1), this could indicate age-related changes in how the LC–NE system modulates cortical gain. Alternatively, if effort main effects are robust but trial-wise coupling is weak, this suggests that older adults may be more sensitive to effort-induced arousal at the group level but less sensitive to moment-to-moment fluctuations, potentially reflecting reduced phasic LC responsiveness or altered gain modulation.

4.3 Limitations

Several limitations should be considered when interpreting the findings:

4.3.1 Pupil Data Quality Constraints

Pupil data quality varies across older adults due to factors such as eye-tracking challenges, increased blink rates, and age-related changes in pupil dynamics. The quality tier system (primary: B1 quality ≥0.50, cognitive quality ≥0.60, plus gap-aware filters; lenient and strict variants) addresses this by testing robustness across different inclusion criteria. Gap-aware quality control metrics (maximum gap size, window duration, valid sample count) help distinguish genuine low dilation from data quality artifacts. However, even with quality filtering, some participants may have limited usable pupil data, potentially reducing power for detecting state-level effects.

4.3.2 Missingness Patterns

The missingness diagnostic tests whether pupil data loss is systematic (e.g., related to effort, stimulus intensity, or task). If missingness is predicted by experimental conditions, this could introduce bias. The robustness checks across quality tiers help evaluate whether findings are sensitive to missingness patterns, but systematic missingness remains a potential limitation.

4.3.3 Sample Size Considerations

The effective sample size for detecting trial-level interactions depends on both the number of participants and the number of usable trials per participant. If many participants have sparse pupil data, power to detect state-level effects may be limited. The model comparison (AIC) and robustness checks help evaluate whether null findings reflect true absence of effects versus insufficient power.

4.3.4 LC Integrity Subset Limitations

LC integrity data are available for only a subset of participants. If participants with LC data differ systematically from those without (e.g., in age, cognitive function, or behavioral performance), this limits the generalizability of LC integrity findings. The LC subset bias check addresses this, but the exploratory LC analyses should be interpreted with caution given the subset limitation.

4.3.5 Temporal Overlap and Confound Mitigation

The primary cognitive pupil metric uses a fixed 1.20-second stimulus-locked window (4.85-6.05s) to reduce RT confounding and capture the TEPR peak. However, this window unavoidably overlaps with multiple task events: the response screen appears at 4.70s (only 350ms after target onset at 4.35s), and button presses occur at variable times (typically 5.3-5.4s for median RTs). This temporal overlap creates potential confounds from visual/attentional responses to the response screen and motor/blink artifacts from button presses.

Several mitigation strategies are implemented: (a) RT is included as a covariate in all models to control for decision state; (b) mean dilation (AUC/duration) is used rather than raw AUC to remove duration confounds; (c) sensitivity analyses use slow-RT trials where motor cannot contaminate the primary window; and (d) motor-buffered and pre-response window definitions are computed and available for future sensitivity analyses when full AUC computation is completed. However, complete elimination of temporal overlap confounds is impossible given the task design, and the primary window represents a trade-off between capturing relevant arousal dynamics and minimizing measurement artifacts. Response-locked metrics (available for future sensitivity analyses) may better capture decision-related arousal but are more susceptible to RT confounding. The choice of window definition represents a trade-off between physiological validity and measurement confounds.

4.4 Integration with Chapter 1 and Chapter 3

Chapter 2 extends the dual-task paradigm from younger adults (Chapter 1) to older adults, testing whether arousal–sensitivity coupling is preserved or altered with age. If Chapter 1 shows robust coupling in younger adults but Chapter 2 shows weaker or absent coupling in older adults, this would suggest age-related changes in how arousal modulates perceptual sensitivity. Alternatively, if both chapters show similar patterns, this would suggest that the arousal–sensitivity relationship is relatively stable across the adult lifespan, with age effects manifesting primarily in baseline sensitivity or effort reactivity rather than in the coupling mechanism itself.

Chapter 3 will use hierarchical drift diffusion modeling to decompose the behavioral patterns observed in Chapter 2 into latent decision parameters (drift rate, boundary separation, starting point, non-decision time). This will test whether arousal effects are more consistent with altered evidence quality (drift rate changes) versus strategic response adjustments (boundary separation changes), providing a mechanistic account of how physical effort and arousal reshape perceptual decisions in aging.

4.5 Conclusions and Future Directions

Chapter 2 provides a statistically rigorous characterization of how pupil-indexed arousal covaries with psychophysical decision behavior in older adults, using continuous stimulus intensity modeling and within-subject centering to preserve individual arousal granularity. The findings contribute to understanding whether physical effort–induced arousal modulates perceptual sensitivity at the trial level, and whether this relationship is robust across data quality thresholds and model specifications.

Future research directions include: (a) examining whether different pupil window definitions (e.g., motor-buffered windows, pre-response windows, response-locked metrics) reveal stronger coupling when full AUC computation is completed, (b) testing whether LC integrity moderates arousal–sensitivity relationships, (c) comparing arousal–sensitivity coupling between younger and older adults directly, (d) using computational modeling (Chapter 3) to identify the latent decision parameters that implement the observed behavioral patterns, and (e) implementing event-based GLM/deconvolution approaches to directly separate target-driven, response-screen-driven, and motor-driven pupil components.

The work presented here establishes the empirical and methodological foundation for Chapter 3, which will examine whether latent decision parameters provide a coherent mechanistic account of how physical effort and arousal reshape perceptual decisions in aging.

5 References

Alnæs, D., Sneve, M. H., Espeseth, T., Endestad, T., Pavert, S. H. P. van de, & Laeng, B. (2014). Pupil size signals mental effort deployed during multiple object tracking and predicts brain activity in the dorsal attention network and the locus coeruleus. Journal of Vision, 14(4), 1–1.
Aston-Jones, G., & Cohen, J. D. (2005). An integrative theory of locus coeruleus–norepinephrine function: Adaptive gain and optimal performance. Annual Review of Neuroscience, 28, 403–450. https://doi.org/10.1146/annurev.neuro.28.061604.135709
Azer, L., Xie, W., Park, H.-B., & Zhang, W. (2023). Detrimental effects of effortful physical exertion on a working memory dual-task in older adults. Psychology and Aging, 38(4), 291–304. https://doi.org/10.1037/pag0000730
Beatty, J. (1982). Task-evoked pupillary responses, processing load, and the structure of processing resources. Psychological Bulletin, 91(2), 276–292. https://doi.org/10.1037/0033-2909.91.2.276
Bennett, I. J., Langley, J., Sun, A., Solis, K., Seitz, A. R., & Hu, X. P. (2024). Locus coeruleus contrast and diffusivity metrics differentially relate to age and memory performance. Scientific Reports, 14(1), 15372.
Beurskens, R., Helmich, I., Rein, R., & Bock, O. (2014). Age-related changes in prefrontal activity during walking in dual-task situations: A fNIRS study. International Journal of Psychophysiology, 92(3), 122–128. https://doi.org/10.1016/j.ijpsycho.2014.03.005
Duchowski, A. T., Biele, C., Niedzielska, A., Krejtz, K., Krejtz, I., Kiefer, P., Raubal, M., & Giannopoulos, I. (2018). The index of pupillary activity: Measuring cognitive load vis-à-vis task difficulty with pupil oscillation. Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, 1–13. https://doi.org/10.1145/3173574.3173856
Eldar, E., Cohen, J. D., & Niv, Y. (2013). The effects of neural gain on attention and learning. Nature Neuroscience, 16(8), 1146–1153. https://doi.org/10.1038/nn.3428
Enders, C. K. (2013). Centering predictors and contextual effects. In G. A. Marcoulides & R. E. Schumacker (Eds.), The SAGE handbook of multilevel modeling (pp. 89–108). SAGE Publications. https://doi.org/10.4135/9781446247600.n6
Gilzenrat, M. S., Nieuwenhuis, S., Jepma, M., & Cohen, J. D. (2010). Pupil diameter tracks changes in control state predicted by the adaptive gain theory of locus coeruleus function. Cognitive, Affective, & Behavioral Neuroscience, 10(2), 252–269. https://doi.org/10.3758/CABN.10.2.252
Granholm, E., Holden, J., Link, P. C., McQuaid, J. R., & Jeste, D. V. (2007). Effortful cognitive resource allocation and negative symptom severity in schizophrenia. Schizophrenia Bulletin, 33(3), 831–842. https://doi.org/10.1093/schbul/sbl051
Green, D. M., Swets, J. A., et al. (1966). Signal detection theory and psychophysics (Vol. 1). Wiley New York.
Hershman, R., Milshtein, Y., & Henik, A. (2024). Processing and analyzing of pupillometry data. In M. H. Papesh & S. D. Goldinger (Eds.), Modern pupillometry: Cognition, neuroscience, and practical applications. Springer Nature. https://doi.org/10.1007/978-3-031-54896-3
Hoffman, L. (2015). Longitudinal analysis: Modeling within-person fluctuation and change. Routledge. https://doi.org/10.4324/9781315744094
Huang, R., & Clewett, D. (2024). The locus coeruleus: Where cognitive and emotional processing meet the eye. In M. H. Papesh & S. D. Goldinger (Eds.), Modern pupillometry: Cognition, neuroscience, and practical applications (pp. 3–75). Springer Nature. https://doi.org/10.1007/978-3-031-54896-3_1
Jepma, M., Verdonschot, R. G., Van Steenbergen, H., Rombouts, S. A., & Nieuwenhuis, S. (2012). Neural mechanisms underlying the induction and relief of perceptual curiosity. Frontiers in Behavioral Neuroscience, 6, 5.
Joshi, S., Li, Y., Kalwani, R. M., & Gold, J. I. (2016). Relationships between pupil diameter and neuronal activity in the locus coeruleus, colliculi, and cingulate cortex. Neuron, 89(1), 221–234. https://doi.org/10.1016/j.neuron.2015.11.028
Kahneman, D. (1973). Attention and effort. Prentice-Hall.
Kahneman, D., & Beatty, J. (1966). Pupil diameter and load on memory. Science, 154(3756), 1583–1585. https://doi.org/10.1126/science.154.3756.1583
Kret, M. E., & Sjak-Shie, E. E. (2019). Preprocessing pupil size data: Guidelines and code. Behavior Research Methods, 51(3), 1336–1342.
Laeng, B., & Mathôt, S. (2024). Methodological aspects of pupillometry. In M. H. Papesh & S. D. Goldinger (Eds.), Modern pupillometry: Cognition, neuroscience, and practical applications. Springer Nature. https://doi.org/10.1007/978-3-031-54896-3
Lalande, S., Sawicki, C. P., Baker, J. R., & Shoemaker, J. K. (2014). Effect of age on the hemodynamic and sympathetic responses at the onset of isometric handgrip exercise. Journal of Applied Physiology, 116(2), 222–227. https://doi.org/10.1152/japplphysiol.01022.2013
Lee, T.-H., Greening, S. G., Ueno, T., Clewett, D., Ponzio, A., Sakaki, M., & Mather, M. (2018). Arousal increases neural gain via the locus coeruleus-noradrenaline system in younger adults but not in older adults. Nature Human Behaviour, 2(5), 356–366. https://doi.org/10.1038/s41562-018-0344-1
Macmillan, N. A., & Creelman, C. D. (2005). Detection theory: A user’s guide.
Mark, A. L., Victor, R. G., Nerhed, C., & Wallin, B. G. (1985). Microneurographic studies of the mechanisms of sympathetic nerve responses to static exercise in humans. Circulation Research, 57(3), 461–469. https://doi.org/10.1161/01.RES.57.3.461
Mather, M., Clewett, D., Sakaki, M., & Harley, C. W. (2016). Norepinephrine ignites local hotspots of neuronal excitation: How arousal amplifies selectivity in perception and memory. Behavioral and Brain Sciences, 39, e200. https://doi.org/10.1017/S0140525X15000667
Mather, M., & Harley, C. W. (2016). The locus coeruleus: Essential for maintaining cognitive function and the aging brain. Trends in Cognitive Sciences, 20(3), 214–226. https://doi.org/10.1016/j.tics.2016.01.001
Mathôt, S., & Vilotijević, A. (2023). Methods in cognitive pupillometry: Design, preprocessing, and statistical analysis. Behavior Research Methods, 55(6), 3055–3077.
McDougal, D. H., & Gamlin, P. D. (2010). The influence of intrinsically photosensitive retinal ganglion cells on the spectral sensitivity and response dynamics of the human pupillary light reflex. Vision Research, 50(1), 72–87. https://doi.org/10.1016/j.visres.2009.10.012
Mikneviciute, G., Ballhausen, N., Rimmele, U., & Kliegel, M. (2022). Does older adults’ cognition particularly suffer from stress? A systematic review of acute stress effects on cognition in older age. Neuroscience & Biobehavioral Reviews, 132, 583–602. https://doi.org/10.1016/j.neubiorev.2021.12.010
Murphy, P. R., O’Connell, R. G., O’Sullivan, R., Robertson, I. H., & Balsters, J. H. (2014). Pupil diameter covaries with BOLD activity in human locus coeruleus. Human Brain Mapping, 35(8), 4140–4154. https://doi.org/10.1002/hbm.22466
Navon, D. (1984). Resources—a theoretical soup stone? Psychological Review, 91(2), 216–234. https://doi.org/10.1037/0033-295X.91.2.216
Navon, D. (1985). Attention division or attention sharing? In M. I. Posner & O. S. M. Marin (Eds.), Attention and performance XI: Mechanisms of attention (pp. 133–146). Lawrence Erlbaum Associates.
Papesh, M. H., & Goldinger, S. D. (2024). Modern pupillometry: Cognition, neuroscience, and practical applications. Springer Nature. https://doi.org/10.1007/978-3-031-54896-3
Pashler, H. (1994). Dual-task interference in simple tasks: Data and theory. Psychological Bulletin, 116(2), 220.
Preuschoff, K., Hart, B. M. ’t, & Einhäuser, W. (2011). Pupil dilation signals surprise: Evidence for noradrenaline’s role in decision making. Frontiers in Neuroscience, 5, 115.
Prins, N. et al. (2016). Psychophysics: A practical introduction. Academic Press.
Proctor, R. W. (2012). Action selection. In Handbook of psychology, second edition. John Wiley & Sons. https://doi.org/10.1002/9781118133880.hop204011
Salthouse, T. A. (1996). The processing-speed theory of adult age differences in cognition. Psychological Review, 103(3), 403–428. https://doi.org/10.1037/0033-295X.103.3.403
Sander, M., Macefield, V. G., & Henderson, L. A. (2010). Cortical and brain stem changes in neural activity during static handgrip and postexercise ischemia in humans. Journal of Applied Physiology, 108(6), 1691–1700. https://doi.org/10.1152/japplphysiol.91539.2008
Schütt, H. H., Harmeling, S., Macke, J. H., & Wichmann, F. A. (2016). Painfree and accurate bayesian estimation of psychometric functions for (potentially) overdispersed data. Vision Research, 122, 105–123.
Toska, K. (2010). Handgrip contraction induces a linear increase in arterial pressure by peripheral vasoconstriction, increased heart rate and a decrease in stroke volume. Acta Physiologica, 200(3), 211–221. https://doi.org/10.1111/j.1748-1716.2010.02144.x
Verhaeghen, P., Steitz, D. W., Sliwinski, M. J., & Cerella, J. (2003). Aging and dual-task performance: A meta-analysis. Psychology and Aging, 18(3), 443–460. https://doi.org/10.1037/0882-7974.18.3.443
Wichmann, F. A., & Hill, N. J. (2001). The psychometric function: I. Fitting, sampling, and goodness of fit. Perception & Psychophysics, 63(8), 1293–1313.
Wickens, C. D. (2008). Multiple resources and mental workload. Human Factors, 50(3), 449–455. https://doi.org/10.1518/001872008X288394
Woollacott, M., & Shumway-Cook, A. (2002). Attention and the control of posture and gait: A review of an emerging area of research. Gait & Posture, 16(1), 1–14. https://doi.org/10.1016/S0966-6362(01)00156-4
Yerkes, R. M., & Dodson, J. D. (1908). The relation of strength of stimulus to rapidity of habit-formation. Journal of Comparative Neurology and Psychology, 18(5), 459–482. https://doi.org/10.1002/cne.920180503